feat: Add attachment download support for issues and pull requests#439
Merged
josegonzalez merged 2 commits intojosegonzalez:masterfrom Nov 6, 2025
Merged
feat: Add attachment download support for issues and pull requests#439josegonzalez merged 2 commits intojosegonzalez:masterfrom
josegonzalez merged 2 commits intojosegonzalez:masterfrom
Conversation
Contributor
Author
|
I noticed I did not replicate the atomic write technique of os.rename for either the manifest or attachment binary. Will try to get to that. |
Contributor
Author
|
Done. Added atomic file writes with cleanup on exception. Retested on multiple repos, all working correctly. Future: Would like to add pytest-based tests (dev dependency only) |
This was referenced Nov 3, 2025
Owner
|
Mind rebasing this onto master? |
Adds new --attachments flag that downloads user-uploaded files from issue and PR bodies and comments. Key features: - Determines attachment URLs - Tracks downloads in manifest.json with metadata - Supports --skip-existing to avoid re-downloading - Handles filename collisions with counter suffix - Smart retry logic for transient vs permanent failures - Uses Content-Disposition for correct file extensions
Contributor
Author
Done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add --attachments flag to download issue and PR attachments
Fixes #399
Problem
User-uploaded attachments (images, videos, documents) referenced in issues and PRs are not currently backed up. Only their URLs are preserved in the JSON files. If a repository or account is deleted, these attachments are permanently lost - the URLs become inaccessible, and critical context like screenshots, diagrams, and documentation is gone forever.
Solution
The solution turns out harder than you would think! There is a lot to do in order to determine attachments and then get them. However after doing all that, we end up with something as simple as:
Adds
--attachmentsflag to download binary attachments alongside issue/PR data.Makes it look so easy!
What Qualifies as an Attachment?
GitHub's API doesn't have an "attachment" concept - it's a user behavior pattern. Through analysis of real-world repositories, attachments are defined as:
Note this approach can captures bare URLs pasted by users, not just markdown image syntax. Occasional false positives, such as example URLs like https://github.com/user-attachments/assets/abc12345-6789-4def-ghij-klmnopqrstuv might fail with HTTP 404. Any failure is logged transparently in the manifest file.
Implementation
retrieve_issues()andretrieve_pulls()functions--skip-existing: Loads existing manifest.json and only downloads new attachments (URLs not already in manifest)issues/attachments/{issue_number}/andpulls/attachments/{pull_number}/directoriesTechnical Details
URL Format Support:
github.com/{owner}/{repo}/files/*(filtered to current repo only)github.com/{owner}/{repo}/assets/*(filtered to current repo only)github.com/user-attachments/{assets,files}/*user-images.githubusercontent.com/*andprivate-user-images.githubusercontent.com/*Smart URL Extraction:
Cross-Platform Compatibility:
Private Repository Support:
github.com/user-attachments/*URLs require authentication for private reposuser-images.githubusercontent.com) don't need/accept authentication--token,--keychain, etc.)Incremental Backup Support:
--skip-existingworks intelligently with attachments--skip-existingonly downloads the new one, preserving the existing 3 and adding the 4th to the manifest.Progress Logging:
logger.info()messages matching existing tool patterns2025-11-02T22:06:42.152: Downloading 1 attachment(s) for issue #42File Conflict Handling:
report.pdf→report_1.pdf→report_2.pdfmanifest.json. Any conflicting attachment is renamed (manifest_1.json)Manifest Files:
Each issue/PR with attachments gets a
manifest.json. For example the manifest for py-pdf/pypdf#1034 is:{ "issue_number": 1034, "issue_type": "issue", "repository": "py-pdf/pypdf", "manifest_updated_at": "2025-11-02T14:21:51.647658+00:00", "attachments": [ { "url": "https://github.com/py-pdf/PyPDF2/files/9001047/test_google_sheet.pdf", "success": true, "http_status": 200, "content_type": "application/pdf", "original_filename": "test_google_sheet.pdf", "size_bytes": 15350, "downloaded_at": "2025-11-02T14:21:47.965316+00:00", "error": null, "saved_as": "test_google_sheet.pdf" }, { "url": "https://user-images.githubusercontent.com/1658117/176374973-baf90c95-a3bd-4a8a-9823-5fbbe7f31f69.png", "success": true, "http_status": 200, "content_type": "image/png", "original_filename": "176374973-baf90c95-a3bd-4a8a-9823-5fbbe7f31f69.png", "size_bytes": 156036, "downloaded_at": "2025-11-02T14:21:49.943060+00:00", "error": null, "saved_as": "176374973-baf90c95-a3bd-4a8a-9823-5fbbe7f31f69.png" } ] }Testing
Tested on 2 major repositories with production-scale data:
[1] Availability rate: Percentage of attachment URLs that were successfully downloaded. Failures are typically deleted attachments, expired JWT tokens, or example URLs of github attachments outside of a code block. These don't show up on the web version of the issue either. See "About Attachments" in the README for further details.
[2] The 81.8% rate for python-github-backup is expected: the 2 failed downloads are fake example URLs (e.g.,
https://github.com/user-attachments/assets/abc12345-...) posted in issue #399 to demonstrate URL patterns. They were correctly extracted and attempted, returning 404 as expected for non-existent URLs.Cases Tested:
github.com/{owner}/{repo}/files/*(filtered to current repo only)github.com/{owner}/{repo}/assets/*(filtered to current repo only)--skip-existing(only downloads new attachments)Cross-Platform:
Backward Compatibility
Files Changed
github_backup/github_backup.py(+637/-3): Core attachment download implementation with manifest tracking, collision detection, andsmart retry logic
README.rst(+30/-0): Documentation for--attachmentsflag and manifest formatThank you for considering this PR as an addition to an already great tool!